Fix/arc e2g vroom segfault#102
Draft
kaybrand wants to merge 4 commits intoEngreitzLab:mainfrom
Draft
Conversation
… segfault
genomation::readGeneric() calls vroom::vroom(altrep=lazy), which uses
mmap() to lazily map files into memory. On Lustre (OAK), lock
revocation or client cache eviction can invalidate mmap'd pages,
causing SIGSEGV with SEGV_ACCERR in the vroom C++ layer — a hardware
fault below R's managed memory that cannot be caught or recovered from.
Replace both readGeneric() calls with fread() + makeGRangesFromDataFrame():
- bed.narrowPeak: fread(select=1:3) reads only chr/start/end,
matching readGeneric's default keep.all.metadata=FALSE behavior.
- bed.allPutative: fread(select=c("chr","start","end","TargetGene"))
reads only the 4 columns actually used (findOverlaps coordinates +
TargetGene assignment at line 33), skipping ~20 unused ABC columns.
This recovers the practical benefit of vroom's lazy column parsing
without mmap.
Both reads now use starts.in.df.are.0based=TRUE to correctly convert
0-based BED [a, b) to 1-based inclusive GRanges [a+1, b]. The old code
loaded BED start directly as a 1-based GRanges start, which elongated
every region by 1bp and caused spurious findOverlaps hits at region
boundaries.
Output coordinate preservation:
- PeakName construction uses start(gr) - 1L to emit the original
0-based BED start.
- df.pairs.e2g$start is decremented by 1 before fwrite, restoring
0-based BED format in the output TSV.
- On-disk output is format-identical to previous runs.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ap segfault
Same vroom mmap segfault fix as make_kendall_pairs.R (see prior commit
for full root cause description).
Replace readGeneric() with fread(select=c("chr","start","end","PeakName"))
+ makeGRangesFromDataFrame(starts.in.df.are.0based=TRUE). Only the 4
columns actually used are read; TargetGene and PairName are skipped.
Replace library(genomation) with library(data.table) for fread().
ATAC matrix rowname fix:
Signac::FeatureMatrix derives rownames via GRangesToString, which
emits 1-based GRanges starts. Under the old (buggy) code, the
GRanges 1-based start numerically equalled the 0-based BED start,
so rownames accidentally matched the 0-based PeakName strings in
Pairs.Kendall.tsv.gz. With the corrected coordinate loading
(starts.in.df.are.0based=TRUE), GRanges starts are now bed_start+1,
so default Signac rownames would be chr-(a+1)-b while PeakName is
chr-a-b.
After FeatureMatrix, rownames are explicitly set to
paste(seqnames, start-1L, end, sep="-") to restore 0-based BED
naming. FeatureMatrix preserves input feature order, so this is a
safe positional rename. compute_kendall.R matches
mcols(bed.E2G)[,"PeakName"] %in% rownames(data.ATAC), so the two
must agree on coordinate convention.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gfault Same vroom mmap segfault fix as prior commits (see make_kendall_pairs commit for full root cause description). Replace readGeneric(kendall_pairs_path, keep.all.metadata=T, header=T) with makeGRangesFromDataFrame(fread(kendall_pairs_path), keep.extra.columns=TRUE, starts.in.df.are.0based=TRUE). All columns from kendall_pairs are used in this script, so no select= narrowing. Remove library(genomation) from imports. Output coordinate preservation: df.pairs.E2G$start is decremented by 1 before fwrite, restoring 0-based BED format in Pairs.Kendall.tsv.gz. This file is consumed by compute_arc_e2g.R, which expects 0-based BED coordinates. PeakName is carried through from the input file's metadata column (not reconstructed), so it remains in its original 0-based format. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…gfault This is the script that originally triggered the investigation: 1 of 4 scE2G runs segfaulted in vroom_() C++ code during the arc_e2g rule (SIGSEGV with SEGV_ACCERR, "invalid permissions"). Root cause: vroom's mmap-based lazy reading on Lustre, where lock revocation invalidates mmap'd pages below R's memory management layer. Replace readGeneric(abc_predictions_path) with makeGRangesFromDataFrame(fread(abc_predictions_path), keep.extra.columns=TRUE, starts.in.df.are.0based=TRUE). All ABC columns are used (AddMax, IntegrateABC), so no select= narrowing. Replace GRanges(fread(kendall_predictions_path)) with makeGRangesFromDataFrame(fread(kendall_predictions_path), keep.extra.columns=TRUE, starts.in.df.are.0based=TRUE). The old GRanges(fread()) call also misinterpreted 0-based BED starts. Remove library(genomation) from imports. Coordinate correction: The old code loaded BED [a, b) as GRanges [a, b] (treating the 0-based start as 1-based). This elongated every region by 1bp, causing findOverlaps in AddMax to produce spurious hits where two regions share a boundary but don't overlap in BED semantics. Quantified on igvf10/thp1_2: 1,502 of 10,303,415 ABC rows (0.015%) were affected; max ARC.E2G.Score ratio old/new = 2.05; max absolute score change = 0.0045. Output coordinate preservation: df.output$start is decremented by 1 before fwrite, restoring 0-based BED format in EnhancerPredictionsAllPutative_ARC.tsv.gz. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Collaborator
Author
|
We will come back to this at a later date. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
readGeneric() from the genomation package uses vroom_() for lazy loading. vroom uses a memory map that was implemented in C++. On a distributed system, threads may find themselves using an outdated memory map. This patch prevents a race condition that caused a segmentation fault when one thread tried to read from a memory address that another thread had just freed and the OS threw an error to prevent data corruption.
Four scripts previously using readGeneric are modified to use makeGRangesFromDataFrame(fread(path_to_data) instead. The data are 0-indexed, requiring us to tell makeGRangesFromDataFrame() that starts.in.df.are.0based = TRUE. GRanges prefers 1-indexed data, so it adds 1 to the start position as it reads in the data. Adjusted scripts must also subtract 1 from the start position before writing data to maintain the use of 0-indexed convention in scE2G.
1-indexed: [start, end]
0-indexed: [start, end)